Annotating Sanskrit Corpus: Adapting IL-POSTS

نویسندگان

  • Girish Nath Jha
  • Madhav Gopal
  • Diwakar Mishra
چکیده

In this paper we present an experiment on the use of the hierarchical Indic Languages POS Tagset (IL-POSTS) (Baskaran et al 2008 a&b) , developed by Microsoft Research India (MSRI) for tagging Indian languages, for annotating Sanskrit corpus. Sanskrit is a language with richer morphology and relatively free word-order. The authors have included and excluded certain tags according to the requirements of the Sanskrit data. A revision to the annotation guidelines done for IL-POSTS is also presented. The authors also present an experiment of training the tagger at MSRI and documenting the results.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Annotating Uncertainty in Hungarian Webtext

Uncertainty detection has been a popular topic in natural language processing, which manifested in the creation of several corpora for English. Here we show how the annotation guidelines originally developed for English standard texts can be adapted to Hungarian webtext. We annotated a small corpus of Facebook posts for uncertainty phenomena and we illustrate the main characteristics of such te...

متن کامل

Believe Me - We Can Do This! Annotating Persuasive Acts in Blog Text

This paper describes the development of a corpus of blog posts that are annotated for the presence of attempts to persuade and corresponding tactics employed in persuasive messages. We investigate the feasibility of classifying blog posts as persuasive or non-persuasive on the basis of lexical features in the text and the tactics (as provided by human annotators). Annotated tactics provide subs...

متن کامل

a-headers from the As.t.ādhyāyı̄ in Sanskrit literature from the perspective of corpus linguistics

The paper presents strategies for evaluating the influence of Pān. ini’s As.t.ādhyāyı̄ on the vocabulary of Sanskrit. Using a corpus linguistic approach, it examines how the Pān. inian sample words are distributed over post-Pān. inian Sanskrit, and if we can determine any lexicographic influence of the As.t.ādhyāyı̄ on later Sanskrit. The primary focus of the paper lies on data exploration, becau...

متن کامل

An Approach for Grammatical Constructs of Sanskrit Language using Morpheme and Parts- of-Speech Tagging by Sanskrit Corpus

Sanskrit since many thousands of years has been the oriental language of India. It is the base for most of the Indian Languages. Statistical processing of Natural Language is based on corpora (singular corpus). Collection of texts of the written and spoken words is known as Language corpus, which is collected in an organized way, in electronic media for the purpose of linguistic research. It pr...

متن کامل

Coarse Semantic Classification of Rare Nouns Using Cross-Lingual Data and Recurrent Neural Networks

The paper presents a method for WordNet supersense tagging of Sanskrit, an ancient Indian language with a corpus grown over four millenia. The proposed method merges lexical information from Sanskrit texts with lexicographic definitions from Sanskrit-English dictionaries, and compares the performance of two machine learning methods for this task. Evaluation concentrates on Vedic, the oldest lay...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009